noise suppression
DroneAudioset: An Audio Dataset for Drone-based Search and Rescue
Gupta, Chitralekha, Ramesh, Soundarya, Sasikumar, Praveen, Yeo, Kian Peen, Nanayakkara, Suranga
Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search and rescue missions to detect human presence. Existing systems primarily leverage vision-based methods which are prone to fail under low-visibility or occlusion. Drone-based audio perception offers promise but suffers from extreme ego-noise that masks sounds indicating human presence. Existing datasets are either limited in diversity or synthetic, lacking real acoustic interactions, and there are no standardized setups for drone audition. To this end, we present DroneAudioset (The dataset is publicly available at https://huggingface.co/datasets/ahlab-drone-project/DroneAudioSet/ under the MIT license), a comprehensive drone audition dataset featuring 23.5 hours of annotated recordings, covering a wide range of signal-to-noise ratios (SNRs) from -57.2 dB to -2.5 dB, across various drone types, throttles, microphone configurations as well as environments. The dataset enables development and systematic evaluation of noise suppression and classification methods for human-presence detection under challenging conditions, while also informing practical design considerations for drone audition systems, such as microphone placement trade-offs, and development of drone noise-aware audio processing. This dataset is an important step towards enabling design and deployment of drone-audition systems.
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- Health & Medicine (0.87)
A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance
Su, Jie, Wang, Weiwei, Gu, Zhaotian, Wang, Dahui, Qian, Tianyi
Robust information representation and its persistent maintenance are fundamental for higher cognitive functions. Existing models employ distinct neural mechanisms to separately address noise-resistant processing or information maintenance, yet a unified framework integrating both operations remains elusive -- a critical gap in understanding cortical computation. Here, we introduce a recurrent neural circuit that combines divisive normalization with self-excitation to achieve both robust encoding and stable retention of normalized inputs. Mathematical analysis shows that, for suitable parameter regimes, the system forms a continuous attractor with two key properties: (1) input-proportional stabilization during stimulus presentation; and (2) self-sustained memory states persisting after stimulus offset. We demonstrate the model's versatility in two canonical tasks: (a) noise-robust encoding in a random-dot kinematogram (RDK) paradigm; and (b) approximate Bayesian belief updating in a probabilistic Wisconsin Card Sorting Test (pWCST). This work establishes a unified mathematical framework that bridges noise suppression, working memory, and approximate Bayesian inference within a single cortical microcircuit, offering fresh insights into the brain's canonical computation and guiding the design of biologically plausible artificial neural architectures.
- North America > United States > Wisconsin (0.24)
- Asia > China > Beijing > Beijing (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)
CheapNET: Improving Light-weight speech enhancement network by projected loss function
Tan, Kaijun, Dai, Benzhe, Li, Jiakui, Mao, Wenyu
Noise suppression and echo cancellation are critical in speech enhancement and essential for smart devices and real-time communication. Deployed in voice processing front-ends and edge devices, these algorithms must ensure efficient real-time inference with low computational demands. Traditional edge-based noise suppression often uses MSE-based amplitude spectrum mask training, but this approach has limitations. We introduce a novel projection loss function, diverging from MSE, to enhance noise suppression. This method uses projection techniques to isolate key audio components from noise, significantly improving model performance. For echo cancellation, the function enables direct predictions on LAEC pre-processed outputs, substantially enhancing performance. Our noise suppression model achieves near state-of-the-art results with only 3.1M parameters and 0.4GFlops/s computational load. Moreover, our echo cancellation model outperforms replicated industry-leading models, introducing a new perspective in speech enhancement.
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
Wang, Xiaofei, Thakker, Manthan, Chen, Zhuo, Kanda, Naoyuki, Eskimez, Sefik Emre, Chen, Sanyuan, Tang, Min, Liu, Shujie, Li, Jinyu, Yoshioka, Takuya
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.
Egocentric Audio-Visual Noise Suppression
Sharma, Roshan, He, Weipeng, Lin, Ju, Lakomkin, Egor, Liu, Yang, Kalgaonkar, Kaustubh
This paper studies audio-visual noise suppression for egocentric videos -- where the speaker is not captured in the video. Instead, potential noise sources are visible on screen with the camera emulating the off-screen speaker's view of the outside world. This setting is different from prior work in audio-visual speech enhancement that relies on lip and facial visuals. In this paper, we first demonstrate that egocentric visual information is helpful for noise suppression. We compare object recognition and action classification-based visual feature extractors and investigate methods to align audio and visual representations. Then, we examine different fusion strategies for the aligned features, and locations within the noise suppression model to incorporate visual information. Experiments demonstrate that visual features are most helpful when used to generate additive correction masks. Finally, in order to ensure that the visual features are discriminative with respect to different noise types, we introduce a multi-task learning framework that jointly optimizes audio-visual noise suppression and video-based acoustic event detection. This proposed multi-task framework outperforms the audio-only baseline on all metrics, including a 0.16 PESQ improvement. Extensive ablations reveal the improved performance of the proposed model with multiple active distractors, overall noise types, and across different SNRs.
Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition
For SNR-estimation, the input signal is transformed into so-called Amplitude Modulation Spectrograms (AMS), which rep(cid:173) resent both spectral and temporal characteristics of the respective analysis frame, and which imitate the representation of modula(cid:173) tion frequencies in higher stages of the mammalian auditory sys(cid:173) tem. A neural network is used to analyse AMS patterns generated from noisy speech and estimates the local SNR. Noise suppres(cid:173) sion is achieved by attenuating frequency channels according to their SNR. The noise suppression algorithm is evaluated in speaker(cid:173) independent digit recognition experiments and compared to noise suppression by Spectral Subtraction.
Time-Variance Aware Real-Time Speech Enhancement
Zheng, Chengyu, Zhou, Yuan, Peng, Xiulian, Zhang, Yuan, Lu, Yan
Time-variant factors often occur in real-world full-duplex communication applications. Some of them are caused by the complex environment such as non-stationary environmental noises and varying acoustic path while some are caused by the communication system such as the dynamic delay between the far-end and near-end signals. Current end-to-end deep neural network (DNN) based methods usually model the time-variant components implicitly and can hardly handle the unpredictable time-variance in real-time speech enhancement. To explicitly capture the time-variant components, we propose a dynamic kernel generation (DKG) module that can be introduced as a learnable plug-in to a DNN-based end-to-end pipeline. Specifically, the DKG module generates a convolutional kernel regarding to each input audio frame, so that the DNN model is able to dynamically adjust its weights according to the input signal during inference. Experimental results verify that DKG module improves the performance of the model under time-variant scenarios, in the joint acoustic echo cancellation (AEC) and deep noise suppression (DNS) tasks.
Microsoft Teams can now suppress background noise using machine learning
Microsoft has a couple of new audio features in the works for Teams. Machine-learning-based noise suppression can automatically detect background noise that needs to be suppressed, allowing the communication app to make speech audio clearer within meetings. An automatic music detection feature is also on the way to Teams, though people will have to wait a few months to try it out. The mode reduces bitrate by four times compared to lossless encoding. Microsoft built the new feature with music lessons and performances in mind.
End-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise Suppression
Watcharasupat, Karn N., Nguyen, Thi Ngoc Tho, Gan, Woon-Seng, Zhao, Shengkui, Ma, Bin
Echo and noise suppression is an integral part of a full-duplex communication system. Many recent acoustic echo cancellation (AEC) systems rely on a separate adaptive filtering module for linear echo suppression and a neural module for residual echo suppression. However, not only do adaptive filtering modules require convergence and remain susceptible to changes in acoustic environments, but this two-stage framework also often introduces unnecessary delays to the AEC system when neural modules are already capable of both linear and nonlinear echo suppression. In this paper, we exploit the offset-compensating ability of complex time-frequency masks and propose an end-to-end complex-valued neural network architecture. The building block of the proposed model is a pseudocomplex extension based on the densely-connected multidilated DenseNet (D3Net) building block, resulting in a very small network of only 354K parameters. The architecture utilized the multi-resolution nature of the D3Net building blocks to eliminate the need for pooling, allowing the network to extract features using large receptive fields without any loss of output resolution. We also propose a dual-mask technique for joint echo and noise suppression with simultaneous speech enhancement. Evaluation on both synthetic and real test sets demonstrated promising results across multiple energy-based metrics and perceptual proxies.
Digital stethoscope uses artificial intelligence for diagnosing lung abnormalities
Stethoscopes are a ubiquitous and cost-effective tool for medical diagnosis, but they open the door to subjectivity and can experience high levels of environmental noise. This makes it difficult to properly diagnose lung abnormalities, like COVID-19, by listening to sounds from the body. James West, at Johns Hopkins University, has been developing a digital stethoscope equipped with artificial intelligence for accurate lung diagnoses. He will discuss its opportunities and obstacles at the 179th ASA Meeting.